The dataset (https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) we have chosen is found on Kaggle and contains information about age, gender, bmi (and 7 other parameters) along with whether or not the patient has had a stroke. Each row in the dataset provides relevant information about each of the 5110 patients. For each patient, there are 12 attributes:

The location where these data were collected, as well as the period of collection, remain undisclosed.

The purpose of collection has to do with the fact that stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths, according to the World Health Organization (WHO).

The dataset is under a licence called “Data files © Original Authors”, meaning that the dataset does not belong to the original uploader. The original source, as stated on Kaggle, is confidential and the use of this dataset is restricted to educational purposes only.

# Reading in the data
stroke <- read.csv("healthcare-dataset-stroke-data.csv")

# Transforming bmi from character to numeric
stroke$bmi <- as.numeric(stroke$bmi)
## Warning: NAs introduced by coercion
# DATA CLEANING
sum(is.na(stroke$bmi))
## [1] 201
# Since we only have 201 observations for bmi that are NA 
# (which is just 201/5110 * 100 = 3.93% of the total data entries),
# we can then safely omit these rows when cleaning the data,

stroke <- na.omit(stroke)
# we have 4909 valid observations in the stroke dataset

1. Introduction

Outline your main research question.

Research Question: What are the variables that best allow us to predict stroke for an individual? Is it possible to predict stroke using these variables?

2. EDA

Do some exploratory data analysis to tell an interesting story about your data. Instead of limiting yourself to relationships between just two variables, broaden the scope of your analysis and employ creative approaches that evaluate relationships between more than two variables.

We have done EDA in order to respond to the following subsidiary EDA questions about the dataset:

a. The dataset contains mostly binary variables. Is the dataset balanced with respect to these binary variables?

# GENDER
barplot(table(stroke$gender),
        ylim = c(0, 3000),
        col = c("pink", "blue", "gray"),
        main ="Barplot of Gender")

# Roughly 2000 males and 3000 females and very few labled as "other"

# HYPERTENSION
barplot(table(stroke$hypertension),
        ylim = c(0, 5000),
        col = c("green","red"),
        names = c("no hypertension", "hypertension"),
        main ="Barplot of Hypertension")

# A lot more people without hypertension that with hypertension, as would be expected in a normal population.

# HEART_DISEASE
barplot(table(stroke$heart_disease),
        ylim = c(0, 5000),
        col = c("green","red"),
        names = c("no heart disease", "heart disease"),
        main ="Barplot of Heart Disease")

# Again, a lot more people without heart disease than with heart disease, which is a reasonable result.

# EVER_MARRIED
barplot(table(stroke$ever_married),
        ylim = c(0, 4000),
        col = c("red", "green"),
        main = "Barplot of Ever Married")

# We have more married people than not married, but the importance of this variable is unclear.

# RESIDENCE_TYPE
barplot(table(stroke$Residence_type),
        ylim = c(0, 3000),
        col = c("green", "gray"),
        main = "Barplot of Residence Type")

# The residence type is pretty even, half urban and half rural.

# SMOKING STATUS
barplot(table(stroke$smoking_status),
        ylim = c(0, 2000),
        cex.names = 0.71,
        col = c("orange", "green", "red", "gray"),
        main = "Barplot of Smoking Status")

# Most people in this dataset either never smoked or their status is unknown. 
# Less than 2000 people smoke or formerly smoked.

# Finally, our main variable of interest:
# STROKE:
barplot(table(stroke$stroke),
        ylim = c(0, 5000),
        col = c("green", "red"),
        names= c("no stroke", "stroke"),
        main = "Barplot of Stroke")

# This dataset seems to be a bit unbalanced in terms of how many people have had a stroke or not.
# A lot more people have not ever suffered from a stroke than those who have. This is consistent with reality.

b. Is there any natural segmentation between people who have had a stroke or not based on their lifestyle habits (e.g. Bmi, glucose level etc)?

We can do K-means clustering (in 3D) with the continuous variables :

library(ggpubr)
## Loading required package: ggplot2
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
select <- na.omit(stroke[, c("bmi", "age", "avg_glucose_level")])

# SCALING VARIABLES
select$bmi <- scale(select$bmi)
select$age <- scale(select$age)
select$avg_glucose_level <- scale(select$avg_glucose_level)

select_matrix <- as.matrix(select)
fviz_nbclust(select_matrix, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
select$cluster = factor(kmeans(select,4)$cluster)
p <- plot_ly(select, x=~bmi, y=~age, 
             z=~avg_glucose_level, color=~cluster) %>%
  add_markers(size=1.5)
htmltools::tagList(list(p))

There don’t seem to be any prominent clusters of data, so there is no natural clustering between the participants of this study.

c. Is there any significant linear relationship between the continuous variables in this dataset (age, bmi, avg_glucose level)?

We can try to compute a scatterplot matrix in order to observe the relationships between the continuous variables in this dataset.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
stroke_continuous <- stroke[, c("age","bmi","avg_glucose_level")]
stroke_continuous <- na.omit(stroke_continuous)
ggpairs(stroke_continuous,ggplot2::aes(colour="red"))

There are no obvious strongly linear relationships between the continuous variables bmi, average glucose level and age. Thus, when we compute the model with stroke as dependent variable, we will not have to worry about multicollinearity. To be sure of that, let us compute the correlation matrix with the other numerical predictors included:

stroke_1 <- stroke[, -c(1,2,6,7,8,11)]
cor(stroke_1)
##                         age hypertension heart_disease avg_glucose_level
## age               1.0000000    0.2744249    0.25712278         0.2358382
## hypertension      0.2744249    1.0000000    0.11599099         0.1805427
## heart_disease     0.2571228    0.1159910    1.00000000         0.1545251
## avg_glucose_level 0.2358382    0.1805427    0.15452512         1.0000000
## bmi               0.3333980    0.1678106    0.04135744         0.1755022
## stroke            0.2323309    0.1425146    0.13793779         0.1389359
##                          bmi     stroke
## age               0.33339800 0.23233086
## hypertension      0.16781058 0.14251461
## heart_disease     0.04135744 0.13793779
## avg_glucose_level 0.17550218 0.13893586
## bmi               1.00000000 0.04237366
## stroke            0.04237366 1.00000000

The strongest correlated variables seem to be bmi and age, and then age and hypertension. We can visualize these relationships more nicely in the following way:

library(GGally)
ggcorr(stroke[,-1])
## Warning in ggcorr(stroke[, -1]): data in column(s) 'gender', 'ever_married',
## 'work_type', 'Residence_type', 'smoking_status' are not numeric and were ignored

We see that the highest correlation of stroke with another variable is that with age, even though they are only mildly (positively) correlated.

d. Are older people/people with higher bmis/people with higher average glucose level more likely to suffer from stroke?

boxplot(age ~ stroke, 
        data = stroke, 
        main = "Boxplot of Age vs Stroke", 
        col = c("green", "yellow"))

From this side-by-side boxplot, it seems like older people suffer from stroke more often than younger people, as we would expect in real life.

boxplot(bmi ~ stroke, 
        data = stroke, 
        main = "Boxplot of bmi vs Stroke", 
        col = c("lightblue", "blue"))

There does not seem to be a significant difference in the bmis of people who have had a stroke in the past or not, even though the distribution of the bmis of people who have not ever had a stroke contains far more outliers on the higher end than the distribution of the bmis of people who have had a stroke in the past.

boxplot(avg_glucose_level ~ stroke, 
        data = stroke, 
        main = "Boxplot of Average Glucose Level vs Stroke", 
        col = c("pink", "red"))

It seems that, in general, the people that have higher average blood glucose levels tend to suffer from stroke more often than people who have lower glucose levels.

3. Research question

Use one of your research questions, or come up with a new one depending on feedback from the proposal. If applicable, perform a hypothesis test to answer your question. If you do, make sure to check any applicable assumptions. If you do not use a hypothesis test, use other means to show what answer the data provide.

Since our main research question is “What are the variables that best allow us to predict stroke for an individual?”, we will try to do model selection using three methods:

  1. Lasso Regression with the most appropriate value for lambda

  2. Stepwise regression based on p-values

Lasso Regression for Model Selection

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1
# Generating model matrix
X <- model.matrix(stroke ~ gender + age + hypertension + heart_disease + ever_married + work_type + Residence_type + avg_glucose_level + bmi + smoking_status, data = stroke)[,-1]
# `[,-1]` removes the first column: The first column corresponds to an intercept.
# `glmnet` also adds an intercept, therefore we remove it here.

# Finding the best value for lambda

set.seed(1)
cv_result <- cv.glmnet(x = X, y = stroke$stroke, family = "binomial", keep = TRUE)
plot(cv_result)

cat("The value of lambda the yielded the lowest mean-squared error:", cv_result$lambda.min)
## The value of lambda the yielded the lowest mean-squared error: 0.001500674
# The value of lambda the yielded the lowest mean-squared error: 0.001500674
lasso_fit <- glmnet(x = X, y = stroke$stroke, lambda = 0.001500674, family = "binomial")
coef(lasso_fit)
## 17 x 1 sparse Matrix of class "dgCMatrix"
##                                      s0
## (Intercept)                -7.522473918
## genderMale                  .          
## genderOther                 .          
## age                         0.066255878
## hypertension                0.504055849
## heart_disease               0.351618296
## ever_marriedYes             .          
## work_typeGovt_job           .          
## work_typeNever_worked       .          
## work_typePrivate            0.084822677
## work_typeSelf-employed     -0.199362826
## Residence_typeUrban         .          
## avg_glucose_level           0.004446061
## bmi                         .          
## smoking_statusnever smoked  .          
## smoking_statussmokes        0.245736020
## smoking_statusUnknown      -0.110532397

Lasso Regression has discovered the following representative attributes:

Even though the coefficient for some classes of smoking_status or work_type has been set to 0, we decided we will incorporate into the model any discrete attribute that has at least one class whose coefficient is not 0.

Backward Elimination Based on p-values:

back <- glm(stroke ~ 
              gender + 
              age + 
              hypertension + 
              heart_disease + 
              work_type  + 
              avg_glucose_level + 
              smoking_status + 
              ever_married + 
              Residence_type + 
              bmi, data = stroke)
summary(back)
## 
## Call:
## glm(formula = stroke ~ gender + age + hypertension + heart_disease + 
##     work_type + avg_glucose_level + smoking_status + ever_married + 
##     Residence_type + bmi, data = stroke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.29253  -0.06862  -0.02089   0.00605   1.00712  
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.919e-02  1.589e-02  -1.837 0.066294 .  
## genderMale                 -2.382e-03  5.701e-03  -0.418 0.676089    
## genderOther                -2.662e-02  1.942e-01  -0.137 0.890983    
## age                         2.632e-03  2.079e-04  12.661  < 2e-16 ***
## hypertension                4.857e-02  1.014e-02   4.788 1.74e-06 ***
## heart_disease               5.635e-02  1.344e-02   4.194 2.79e-05 ***
## work_typeGovt_job          -5.954e-02  1.455e-02  -4.093 4.33e-05 ***
## work_typeNever_worked      -2.561e-02  4.232e-02  -0.605 0.545171    
## work_typePrivate           -4.582e-02  1.221e-02  -3.752 0.000177 ***
## work_typeSelf-employed     -6.541e-02  1.483e-02  -4.411 1.05e-05 ***
## avg_glucose_level           3.274e-04  6.561e-05   4.991 6.23e-07 ***
## smoking_statusnever smoked -2.048e-03  8.209e-03  -0.249 0.803014    
## smoking_statussmokes        6.056e-03  9.906e-03   0.611 0.540975    
## smoking_statusUnknown      -6.041e-03  9.338e-03  -0.647 0.517685    
## ever_marriedYes            -2.898e-02  8.214e-03  -3.528 0.000422 ***
## Residence_typeUrban         1.642e-03  5.542e-03   0.296 0.767057    
## bmi                        -6.135e-04  4.041e-04  -1.518 0.129006    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03758521)
## 
##     Null deviance: 200.10  on 4908  degrees of freedom
## Residual deviance: 183.87  on 4892  degrees of freedom
## AIC: -2157
## 
## Number of Fisher Scoring iterations: 2
# ELIMINATE GENDER FIRST
back <- glm(stroke ~ 
              age + 
              hypertension + 
              heart_disease + 
              work_type  + 
              avg_glucose_level + 
              smoking_status + 
              ever_married + 
              Residence_type + 
              bmi, data = stroke)
summary(back)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type + 
##     avg_glucose_level + smoking_status + ever_married + Residence_type + 
##     bmi, data = stroke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.29342  -0.06853  -0.02088   0.00578   1.00807  
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -3.044e-02  1.562e-02  -1.949 0.051399 .  
## age                         2.633e-03  2.079e-04  12.666  < 2e-16 ***
## hypertension                4.849e-02  1.014e-02   4.782 1.79e-06 ***
## heart_disease               5.592e-02  1.339e-02   4.175 3.03e-05 ***
## work_typeGovt_job          -5.922e-02  1.452e-02  -4.079 4.59e-05 ***
## work_typeNever_worked      -2.564e-02  4.231e-02  -0.606 0.544604    
## work_typePrivate           -4.552e-02  1.218e-02  -3.738 0.000188 ***
## work_typeSelf-employed     -6.505e-02  1.480e-02  -4.396 1.12e-05 ***
## avg_glucose_level           3.260e-04  6.552e-05   4.976 6.71e-07 ***
## smoking_statusnever smoked -1.785e-03  8.185e-03  -0.218 0.827377    
## smoking_statussmokes        6.146e-03  9.900e-03   0.621 0.534750    
## smoking_statusUnknown      -5.924e-03  9.330e-03  -0.635 0.525508    
## ever_marriedYes            -2.897e-02  8.211e-03  -3.528 0.000423 ***
## Residence_typeUrban         1.665e-03  5.540e-03   0.300 0.763815    
## bmi                        -6.137e-04  4.039e-04  -1.519 0.128712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03757132)
## 
##     Null deviance: 200.10  on 4908  degrees of freedom
## Residual deviance: 183.87  on 4894  degrees of freedom
## AIC: -2160.8
## 
## Number of Fisher Scoring iterations: 2
# ELIMINATE SMOKING STATUS
back <- glm(stroke ~ 
              age + 
              hypertension + 
              heart_disease + 
              work_type  + 
              avg_glucose_level + 
              ever_married + 
              Residence_type + 
              bmi, data = stroke)
summary(back)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type + 
##     avg_glucose_level + ever_married + Residence_type + bmi, 
##     data = stroke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.29500  -0.06793  -0.02047   0.00544   1.01518  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.635e-02  1.245e-02  -2.919 0.003532 ** 
## age                     2.630e-03  2.060e-04  12.767  < 2e-16 ***
## hypertension            4.901e-02  1.011e-02   4.846 1.30e-06 ***
## heart_disease           5.669e-02  1.337e-02   4.239 2.28e-05 ***
## work_typeGovt_job      -5.514e-02  1.381e-02  -3.992 6.65e-05 ***
## work_typeNever_worked  -2.357e-02  4.211e-02  -0.560 0.575644    
## work_typePrivate       -4.154e-02  1.139e-02  -3.648 0.000268 ***
## work_typeSelf-employed -6.120e-02  1.417e-02  -4.321 1.59e-05 ***
## avg_glucose_level       3.270e-04  6.548e-05   4.994 6.13e-07 ***
## ever_marriedYes        -2.842e-02  8.198e-03  -3.467 0.000531 ***
## Residence_typeUrban     1.872e-03  5.536e-03   0.338 0.735268    
## bmi                    -5.998e-04  4.036e-04  -1.486 0.137309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03756085)
## 
##     Null deviance: 200.10  on 4908  degrees of freedom
## Residual deviance: 183.94  on 4897  degrees of freedom
## AIC: -2165.2
## 
## Number of Fisher Scoring iterations: 2
# ELIMINATE RESIDENCE TYPE
back <- glm(stroke ~ 
              age + 
              hypertension + 
              heart_disease + 
              work_type  + 
              avg_glucose_level + 
              ever_married + 
              bmi, data = stroke)
summary(back)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type + 
##     avg_glucose_level + ever_married + bmi, data = stroke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.29591  -0.06812  -0.02022   0.00539   1.01425  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.539e-02  1.213e-02  -2.918 0.003535 ** 
## age                     2.631e-03  2.060e-04  12.774  < 2e-16 ***
## hypertension            4.900e-02  1.011e-02   4.845 1.31e-06 ***
## heart_disease           5.666e-02  1.337e-02   4.238 2.30e-05 ***
## work_typeGovt_job      -5.514e-02  1.381e-02  -3.992 6.64e-05 ***
## work_typeNever_worked  -2.325e-02  4.209e-02  -0.552 0.580791    
## work_typePrivate       -4.157e-02  1.139e-02  -3.651 0.000264 ***
## work_typeSelf-employed -6.121e-02  1.416e-02  -4.321 1.58e-05 ***
## avg_glucose_level       3.268e-04  6.547e-05   4.991 6.21e-07 ***
## ever_marriedYes        -2.842e-02  8.197e-03  -3.467 0.000530 ***
## bmi                    -5.998e-04  4.036e-04  -1.486 0.137320    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03755406)
## 
##     Null deviance: 200.10  on 4908  degrees of freedom
## Residual deviance: 183.94  on 4898  degrees of freedom
## AIC: -2167.1
## 
## Number of Fisher Scoring iterations: 2
# ELIMINATE BMI
back <- glm(stroke ~ 
              age + 
              hypertension + 
              heart_disease + 
              work_type  + 
              avg_glucose_level + 
              ever_married,
              data = stroke)
summary(back)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type + 
##     avg_glucose_level + ever_married, data = stroke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.29470  -0.06823  -0.02127   0.00636   1.01452  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -4.633e-02  9.638e-03  -4.807 1.58e-06 ***
## age                     2.640e-03  2.059e-04  12.823  < 2e-16 ***
## hypertension            4.748e-02  1.006e-02   4.718 2.45e-06 ***
## heart_disease           5.717e-02  1.337e-02   4.277 1.93e-05 ***
## work_typeGovt_job      -6.046e-02  1.334e-02  -4.532 5.98e-06 ***
## work_typeNever_worked  -2.661e-02  4.204e-02  -0.633 0.526742    
## work_typePrivate       -4.690e-02  1.081e-02  -4.338 1.47e-05 ***
## work_typeSelf-employed -6.624e-02  1.375e-02  -4.816 1.51e-06 ***
## avg_glucose_level       3.146e-04  6.496e-05   4.843 1.32e-06 ***
## ever_marriedYes        -2.973e-02  8.151e-03  -3.648 0.000268 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.03756333)
## 
##     Null deviance: 200.10  on 4908  degrees of freedom
## Residual deviance: 184.02  on 4899  degrees of freedom
## AIC: -2166.9
## 
## Number of Fisher Scoring iterations: 2

The model discovered by backward elimination based on p-values contains:

These results are quite different from the ones obtained via lasso.

Forward Selection based on p-values

# STEPWISE REGRESSION: FORWARD, BASED ON P-VALUE

summary(glm(stroke ~ gender, data = stroke, family = binomial(link = "logit")))$coefficients
##                Estimate   Std. Error      z value      Pr(>|z|)
## (Intercept) -3.14163474   0.09323859 -33.69457684 6.940867e-249
## genderMale   0.06914952   0.14300238   0.48355506  6.287017e-01
## genderOther -9.42442826 324.74370946  -0.02902113  9.768477e-01
summary(glm(stroke ~ age, data = stroke, family = binomial(link = "logit")))$coefficients
##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -7.37723102 0.362382724 -20.35757 3.978460e-92
## age          0.07496941 0.005317943  14.09745 3.937696e-45
summary(glm(stroke ~ hypertension, data = stroke, family = binomial(link = "logit")))$coefficients
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  -3.364515 0.08332751 -40.377001 0.000000e+00
## hypertension  1.490152 0.16176429   9.211872 3.204876e-20
summary(glm(stroke ~ heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)   -3.281267 0.07835514 -41.876860 0.000000e+00
## heart_disease  1.656941 0.18990955   8.724893 2.664315e-18
summary(glm(stroke ~ ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
##                  Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)     -4.292245  0.2099351 -20.445578 6.577429e-93
## ever_marriedYes  1.505642  0.2231153   6.748267 1.496213e-11
summary(glm(stroke ~ work_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                         Estimate Std. Error     z value     Pr(>|z|)
## (Intercept)            -6.507278   1.000746 -6.50242696 7.903440e-11
## work_typeGovt_job       3.439225   1.019249  3.37427432 7.401057e-04
## work_typeNever_worked  -9.058791 310.293410 -0.02919427 9.767096e-01
## work_typePrivate        3.456401   1.004858  3.43969208 5.823764e-04
## work_typeSelf-employed  3.895544   1.010814  3.85386769 1.162664e-04
summary(glm(stroke ~ Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                        Estimate Std. Error     z value      Pr(>|z|)
## (Intercept)         -3.14372115  0.1021333 -30.7805572 4.771318e-208
## Residence_typeUrban  0.05979319  0.1415116   0.4225322  6.726366e-01
summary(glm(stroke ~ avg_glucose_level, data = stroke, family = binomial(link = "logit")))$coefficients
##                      Estimate  Std. Error    z value      Pr(>|z|)
## (Intercept)       -4.43638483 0.175678799 -25.252819 1.054641e-140
## avg_glucose_level  0.01124957 0.001216871   9.244668  2.359720e-20
summary(glm(stroke ~ bmi, data = stroke, family = binomial(link = "logit")))$coefficients
##                Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept) -3.82844339 0.256842054 -14.905828 3.020465e-50
## bmi          0.02415957 0.008128831   2.972084 2.957860e-03
summary(glm(stroke ~ smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
##                              Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)                -2.6162427  0.1372075 -19.067775 4.677881e-81
## smoking_statusnever smoked -0.4305448  0.1769076  -2.433728 1.494424e-02
## smoking_statussmokes       -0.2684148  0.2142419  -1.252858 2.102572e-01
## smoking_statusUnknown      -1.2985352  0.2323605  -5.588451 2.291035e-08
# AGE HAS THE LOWEST P-VALUE, SO WE WILL CHOOSE THAT
summary(glm(stroke ~ age + gender, data = stroke, family = binomial(link = "logit")))$coefficients
##                Estimate   Std. Error      z value     Pr(>|z|)
## (Intercept) -7.42002030 3.692914e-01 -20.09258722 8.567908e-90
## age          0.07500715 5.325108e-03  14.08556251 4.659339e-45
## genderMale   0.09702711 1.491261e-01   0.65063812 5.152801e-01
## genderOther -7.09622848 3.247438e+02  -0.02185178 9.825662e-01
summary(glm(stroke ~ age + hypertension, data = stroke, family = binomial(link = "logit")))$coefficients
##                 Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  -7.31855136 0.363897407 -20.111579 5.843407e-90
## age           0.07191301 0.005407977  13.297582 2.390765e-40
## hypertension  0.64765101 0.169993562   3.809856 1.390477e-04
summary(glm(stroke ~ age + heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
##                  Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)   -7.26864292 0.362784710 -20.035693 2.690704e-89
## age            0.07217707 0.005417203  13.323678 1.685920e-40
## heart_disease  0.51464561 0.200561564   2.566023 1.028720e-02
summary(glm(stroke ~ age + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
##                    Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)     -7.32746499 0.392838810 -18.6525995 1.202843e-77
## age              0.07523089 0.005361696  14.0311745 1.004737e-44
## ever_marriedYes -0.07526895 0.238788309  -0.3152121 7.526007e-01
summary(glm(stroke ~ age + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                           Estimate   Std. Error     z value     Pr(>|z|)
## (Intercept)            -7.11308109 1.002052e+00 -7.09851313 1.261062e-12
## age                     0.07870812 5.770047e-03 13.64080999 2.290028e-42
## work_typeGovt_job      -0.54620694 1.070862e+00 -0.51006302 6.100073e-01
## work_typeNever_worked  -9.73962475 3.096518e+02 -0.03145348 9.749079e-01
## work_typePrivate       -0.36740304 1.057141e+00 -0.34754405 7.281826e-01
## work_typeSelf-employed -0.80286135 1.077045e+00 -0.74542952 4.560121e-01
summary(glm(stroke ~ age + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                        Estimate  Std. Error      z value     Pr(>|z|)
## (Intercept)         -7.38289763 0.368617390 -20.02862001 3.101334e-89
## age                  0.07495791 0.005319159  14.09206085 4.249762e-45
## Residence_typeUrban  0.01241056 0.147547409   0.08411238 9.329671e-01
summary(glm(stroke ~ age + avg_glucose_level, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)       -7.863851504 0.383221332 -20.520391 1.415697e-93
## age                0.071836016 0.005423383  13.245611 4.783731e-40
## avg_glucose_level  0.005596653 0.001228612   4.555264 5.231984e-06
summary(glm(stroke ~ age + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
##                Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept) -7.94351988 0.542971877 -14.629708 1.815653e-48
## age          0.07609125 0.005469275  13.912493 5.319231e-44
## bmi          0.01624901 0.011122646   1.460895 1.440443e-01
summary(glm(stroke ~ age + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
##                               Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)                -7.39162386 0.403044144 -18.3394895 4.005406e-75
## age                         0.07583030 0.005475945  13.8478947 1.309954e-43
## smoking_statusnever smoked -0.07242036 0.183842287  -0.3939266 6.936352e-01
## smoking_statussmokes        0.30663330 0.225449316   1.3600986 1.737987e-01
## smoking_statusUnknown      -0.38304204 0.241441298  -1.5864810 1.126302e-01
# AVG_GLUCOSE_LEVEL HAS THE LOWEST P-VALUE, SO WE WILL CHOOSE THAT
summary(glm(stroke ~ age + avg_glucose_level + gender, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate   Std. Error     z value     Pr(>|z|)
## (Intercept)       -7.877596737 3.873232e-01 -20.3385629 5.862316e-92
## age                0.071852329 5.426958e-03  13.2398893 5.162488e-40
## avg_glucose_level  0.005567628 1.233785e-03   4.5126402 6.402558e-06
## genderMale         0.039349196 1.506131e-01   0.2612600 7.938920e-01
## genderOther       -7.354634943 3.247438e+02  -0.0226475 9.819315e-01
summary(glm(stroke ~ age + avg_glucose_level + hypertension, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)       -7.766874299 0.385207756 -20.162819 2.077042e-90
## age                0.069570876 0.005489880  12.672569 8.392051e-37
## avg_glucose_level  0.005046714 0.001245508   4.051933 5.079616e-05
## hypertension       0.547025603 0.172630396   3.168768 1.530868e-03
summary(glm(stroke ~ age + avg_glucose_level + heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)       -7.751878533 0.385287556 -20.119722 4.958501e-90
## age                0.069711782 0.005507556  12.657479 1.017131e-36
## avg_glucose_level  0.005328059 0.001238933   4.300522 1.703964e-05
## heart_disease      0.418616918 0.202687145   2.065335 3.889128e-02
summary(glm(stroke ~ age + avg_glucose_level + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)       -7.789594046 0.408601787 -19.0640235 5.025727e-81
## age                0.072241969 0.005453879  13.2459789 4.760343e-40
## avg_glucose_level  0.005623909 0.001230790   4.5693476 4.892447e-06
## ever_marriedYes   -0.117295923 0.240601219  -0.4875118 6.258957e-01
summary(glm(stroke ~ age + avg_glucose_level + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                            Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept)            -7.619543226 1.008696e+00 -7.5538521 4.225702e-14
## age                     0.075457883 5.886560e-03 12.8186715 1.288899e-37
## avg_glucose_level       0.005533452 1.230581e-03  4.4966153 6.904374e-06
## work_typeGovt_job      -0.514593970 1.072221e+00 -0.4799325 6.312754e-01
## work_typeNever_worked  -9.718833702 3.092018e+02 -0.0314320 9.749250e-01
## work_typePrivate       -0.336037018 1.058422e+00 -0.3174888 7.508728e-01
## work_typeSelf-employed -0.756468281 1.078583e+00 -0.7013539 4.830822e-01
summary(glm(stroke ~ age + avg_glucose_level + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                         Estimate  Std. Error      z value     Pr(>|z|)
## (Intercept)         -7.869477626 0.388986873 -20.23070232 5.254916e-91
## age                  0.071822532 0.005425147  13.23881776 5.236665e-40
## avg_glucose_level    0.005596978 0.001228693   4.55523058 5.232815e-06
## Residence_typeUrban  0.012527162 0.148357406   0.08443907 9.327073e-01
summary(glm(stroke ~ age + avg_glucose_level + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)       -8.043403036 0.538823904 -14.9277027 2.176357e-50
## age                0.072270558 0.005531337  13.0656581 5.173525e-39
## avg_glucose_level  0.005454762 0.001262780   4.3196447 1.562806e-05
## bmi                0.005563145 0.011557519   0.4813442 6.302719e-01
summary(glm(stroke ~ age + avg_glucose_level + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
##                                Estimate Std. Error     z value     Pr(>|z|)
## (Intercept)                -7.903594326 0.42602447 -18.5519727 7.861314e-77
## age                         0.072903595 0.00557849  13.0686971 4.970964e-39
## avg_glucose_level           0.005478817 0.00123227   4.4461167 8.743648e-06
## smoking_statusnever smoked -0.054455363 0.18515721  -0.2941034 7.686789e-01
## smoking_statussmokes        0.341884635 0.22698826   1.5061776 1.320216e-01
## smoking_statusUnknown      -0.303630346 0.24332968  -1.2478147 2.120989e-01
# HYPERTENSION HAS THE LOWEST P-VALUE, SO WE WILL CHOOSE THAT
summary(glm(stroke ~ age + hypertension + avg_glucose_level + gender, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate   Std. Error      z value     Pr(>|z|)
## (Intercept)       -7.781770240 3.895709e-01 -19.97523508 9.045703e-89
## age                0.069595683 5.493776e-03  12.66809512 8.884573e-37
## hypertension       0.547147323 1.726137e-01   3.16977869 1.525551e-03
## avg_glucose_level  0.005015926 1.250812e-03   4.01013607 6.068377e-05
## genderMale         0.041195282 1.510594e-01   0.27270914 7.850768e-01
## genderOther       -7.312713174 3.247438e+02  -0.02251841 9.820344e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)       -7.660740480 0.387152022 -19.787422 3.820918e-87
## age                0.067546641 0.005571399  12.123820 7.899008e-34
## hypertension       0.539613446 0.173054857   3.118164 1.819814e-03
## avg_glucose_level  0.004802004 0.001254526   3.827743 1.293239e-04
## heart_disease      0.404298420 0.203446548   1.987246 4.689510e-02
summary(glm(stroke ~ age + hypertension + avg_glucose_level + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)       -7.697702775 0.410321307 -18.7601829 1.598484e-78
## age                0.069957658 0.005523751  12.6648831 9.255812e-37
## hypertension       0.546450560 0.172713128   3.1639202 1.556595e-03
## avg_glucose_level  0.005074374 0.001247718   4.0669244 4.763768e-05
## ever_marriedYes   -0.110130436 0.241004702  -0.4569638 6.476970e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                            Estimate   Std. Error     z value     Pr(>|z|)
## (Intercept)            -7.546497794 1.008936e+00 -7.47966058 7.451481e-14
## age                     0.073295078 5.957261e-03 12.30348668 8.674707e-35
## hypertension            0.559604419 1.729560e-01  3.23553143 1.214165e-03
## avg_glucose_level       0.004977561 1.247178e-03  3.99105953 6.577876e-05
## work_typeGovt_job      -0.485869226 1.072735e+00 -0.45292555 6.506024e-01
## work_typeNever_worked  -9.701252883 3.093385e+02 -0.03136128 9.749814e-01
## work_typePrivate       -0.317674408 1.059079e+00 -0.29995343 7.642127e-01
## work_typeSelf-employed -0.755732210 1.079514e+00 -0.70006707 4.838854e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                         Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)         -7.775463584 0.391047699 -19.8836705 5.635628e-88
## age                  0.069551061 0.005491172  12.6659771 9.127677e-37
## hypertension         0.547365843 0.172654226   3.1703009 1.522812e-03
## avg_glucose_level    0.005046535 0.001245634   4.0513801 5.091642e-05
## Residence_typeUrban  0.019029741 0.148805758   0.1278831 8.982415e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)       -7.852290780 0.541579006 -14.4988833 1.231369e-47
## age                0.069793224 0.005592872  12.4789605 9.724796e-36
## hypertension       0.543398746 0.173304241   3.1355190 1.715503e-03
## avg_glucose_level  0.004983550 0.001276095   3.9053114 9.410412e-05
## bmi                0.002620905 0.011597749   0.2259839 8.212139e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
##                                Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)                -7.817975694 0.428113081 -18.2614735 1.677036e-74
## age                         0.070743756 0.005641453  12.5399888 4.510592e-36
## hypertension                0.531583861 0.173219210   3.0688505 2.148841e-03
## avg_glucose_level           0.004991703 0.001246534   4.0044672 6.215741e-05
## smoking_statusnever smoked -0.073695985 0.185830434  -0.3965765 6.916798e-01
## smoking_statussmokes        0.342527512 0.227521618   1.5054724 1.322027e-01
## smoking_statusUnknown      -0.259185034 0.244351365  -1.0607063 2.888234e-01
# HEART DISEASE HAS THE LOWEST P-VALUE STILL LESS THAN 0.05
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + gender, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate   Std. Error      z value     Pr(>|z|)
## (Intercept)       -7.663276556 3.923369e-01 -19.53238744 5.825145e-85
## age                0.067550861 5.577181e-03  12.11200717 9.123390e-34
## hypertension       0.539631103 1.730532e-01   3.11829580 1.819001e-03
## avg_glucose_level  0.004797906 1.258026e-03   3.81383810 1.368251e-04
## heart_disease      0.403109319 2.050314e-01   1.96608582 4.928870e-02
## genderMale         0.007158815 1.524657e-01   0.04695361 9.625502e-01
## genderOther       -7.346792655 3.247438e+02  -0.02262335 9.819507e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)       -7.605991130 0.411245668 -18.4950061 2.265166e-76
## age                0.067895001 0.005623411  12.0736330 1.455631e-33
## hypertension       0.539081598 0.173124647   3.1138351 1.846727e-03
## avg_glucose_level  0.004825268 0.001256749   3.8394829 1.232937e-04
## heart_disease      0.400997394 0.203702340   1.9685458 4.900527e-02
## ever_marriedYes   -0.089925431 0.241799551  -0.3719007 7.099668e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                            Estimate   Std. Error     z value     Pr(>|z|)
## (Intercept)            -7.505346860 1.009144e+00 -7.43734348 1.027302e-13
## age                     0.071219249 6.056140e-03 11.75984242 6.285149e-32
## hypertension            0.550727122 1.733724e-01  3.17655638 1.490348e-03
## avg_glucose_level       0.004732252 1.256763e-03  3.76542799 1.662640e-04
## heart_disease           0.386448475 2.043436e-01  1.89116977 5.860169e-02
## work_typeGovt_job      -0.405739320 1.073300e+00 -0.37802974 7.054085e-01
## work_typeNever_worked  -9.683999384 3.094148e+02 -0.03129779 9.750321e-01
## work_typePrivate       -0.250174174 1.059412e+00 -0.23614443 8.133206e-01
## work_typeSelf-employed -0.677872289 1.080087e+00 -0.62760916 5.302600e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
##                         Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)         -7.673254253 0.392897024 -19.5299373 6.111441e-85
## age                  0.067515342 0.005572265  12.1163186 8.656074e-34
## hypertension         0.540063451 0.173077072   3.1203639 1.806277e-03
## avg_glucose_level    0.004800663 0.001254750   3.8259904 1.302474e-04
## heart_disease        0.405522896 0.203547641   1.9922751 4.634088e-02
## Residence_typeUrban  0.027945581 0.149125639   0.1873962 8.513500e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
##                       Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)       -7.776192003 0.543266749 -14.3137639 1.795278e-46
## age                0.067831111 0.005669895  11.9633815 5.526470e-33
## hypertension       0.534552075 0.173750262   3.0765541 2.094083e-03
## avg_glucose_level  0.004716290 0.001284854   3.6706818 2.419044e-04
## heart_disease      0.406811554 0.203530651   1.9987729 4.563293e-02
## bmi                0.003567092 0.011661870   0.3058765 7.596987e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
##                                Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)                -7.715380015 0.430076067 -17.9395707 5.790067e-72
## age                         0.068759594 0.005730639  11.9985921 3.613914e-33
## hypertension                0.522593784 0.173695992   3.0086692 2.623946e-03
## avg_glucose_level           0.004772674 0.001255004   3.8029161 1.430028e-04
## heart_disease               0.368800613 0.204657023   1.8020423 7.153875e-02
## smoking_statusnever smoked -0.056893666 0.186363000  -0.3052841 7.601498e-01
## smoking_statussmokes        0.320623702 0.228448635   1.4034827 1.604730e-01
## smoking_statusUnknown      -0.257303488 0.244818981  -1.0509948 2.932610e-01
# NO OTHER PARAMETERS WITH P-VALUE LESS THAN 0.05. WE WILL STOP HERE.

Final model by forward selection based on p-values contains the following attributes:

Since lasso, backward elimination and forward selection all yield different models, for the classification part we will stick to the results obtained via:

4. Prediction

Predict or classify one variable in your dataset from other variables. Motivate why you choose the outcome and its predictors. Evaluate the performance of your predictive model or the resulting classifier using appropriate metrics and visualizations. Remark: While the other parts of your presentation should nicely fit together as one whole, this prediction part might not fit as nicely, depending on your research question or data narrative. That is okay.

We have decided to implement 3 classifiers: logistic regression, decision trees and Naive Bayes. As we previously said, we will be using the attributes discovered from both lasso regression and forward selection, so we will end up doing 2 different logistic regressions, 2 different trees and 2 different Naive Bayes classifiers.

Logistic Regression

# LOGISTIC REGRESSION 1: BASED ON PARAMETERS DISCOVERED BY LASSO
#age, 
#hypertension, 
#heart_disease, 
#work_type, -
#avg_glucose_level and 
#smoking_status


log_reg1 <- glm(stroke ~ age + hypertension + heart_disease + work_type + avg_glucose_level + smoking_status,
                data = stroke,
                family = binomial(link = "logit"))

summary(log_reg1)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type + 
##     avg_glucose_level + smoking_status, family = binomial(link = "logit"), 
##     data = stroke)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1445  -0.2935  -0.1533  -0.0732   3.5399  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -7.283429   1.034054  -7.044 1.87e-12 ***
## age                          0.072774   0.006186  11.765  < 2e-16 ***
## hypertension                 0.531510   0.174168   3.052 0.002275 ** 
## heart_disease                0.348127   0.205533   1.694 0.090309 .  
## work_typeGovt_job           -0.704910   1.086388  -0.649 0.516431    
## work_typeNever_worked       -9.799263 308.993942  -0.032 0.974701    
## work_typePrivate            -0.546146   1.071986  -0.509 0.610422    
## work_typeSelf-employed      -0.969312   1.092229  -0.887 0.374830    
## avg_glucose_level            0.004713   0.001257   3.748 0.000178 ***
## smoking_statusnever smoked  -0.062403   0.186637  -0.334 0.738111    
## smoking_statussmokes         0.314113   0.229234   1.370 0.170602    
## smoking_statusUnknown       -0.272125   0.246510  -1.104 0.269632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1728.4  on 4908  degrees of freedom
## Residual deviance: 1363.6  on 4897  degrees of freedom
## AIC: 1387.6
## 
## Number of Fisher Scoring iterations: 14
library(ROCR)
pred = predict(log_reg1, type="response")
predObj = prediction(pred, stroke$stroke)
rocObj = performance(predObj, measure="tpr", x.measure="fpr") 
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:", round(aucObj@y.values[[1]] ,4)))

pred2 = pred
for (i in 1:4909){
        if(pred2[i] > 0.5){
                pred2[i] = 1
        }
        else{
                pred2[i] = 0
        }
}
pred2 <- factor(pred2)

library(caret)
## Loading required package: lattice
confusionMatrix(pred2,factor(stroke$stroke),)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4700  208
##          1    0    1
##                                           
##                Accuracy : 0.9576          
##                  95% CI : (0.9516, 0.9631)
##     No Information Rate : 0.9574          
##     P-Value [Acc > NIR] : 0.4902          
##                                           
##                   Kappa : 0.0091          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.000000        
##             Specificity : 0.004785        
##          Pos Pred Value : 0.957620        
##          Neg Pred Value : 1.000000        
##              Prevalence : 0.957425        
##          Detection Rate : 0.957425        
##    Detection Prevalence : 0.999796        
##       Balanced Accuracy : 0.502392        
##                                           
##        'Positive' Class : 0               
## 
## NOTE: The confusionMatrix function in the caret package uses other formulas than the ones in the textbook for PPV, NPV, Prevalence etc. The only ones that are consistent with textbook are those for Sensitivty and Specificity. 
## If one desires, the following website can be consulted for the exact formulas for the metrics that this function uses: https://www.rdocumentation.org/packages/caret/versions/6.0-86/topics/confusionMatrix

We have set the threshold at 0.5 when deciding whether someone is classified as having had a stroke or not. The accuracy of this model is very high at more than 95%, the sensitivity is 100%, the positive predictive value is also more than 95%, the area under the curve is 0.8526. Nevertheless, these results are misleading and should not be taken into consideration, because our initial dataset contains too few instances of people who have had a stroke (only about 200). Basically, from the confusion matrix, we can see that every individual was classified as not having stroke, because more than 96% of the dataset is represented by people who didn’t have a stroke. The rest of real 1’s (about 4%) have been classified as 0 too. This is very problematic.

We encounter the same problem with the second logistic model, based on the parameters discovered by forward selection:

# LOGISTIC REGRESSION 2: BASED ON PARAMETERS DISCOVERED BY STEPWISE REGRESSION: FORWARD

log_reg2 <- glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease,
                data = stroke,
                family = binomial(link = "logit"))

summary(log_reg2)
## 
## Call:
## glm(formula = stroke ~ age + hypertension + avg_glucose_level + 
##     heart_disease, family = binomial(link = "logit"), data = stroke)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0995  -0.2940  -0.1599  -0.0778   3.5885  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.660740   0.387152 -19.787  < 2e-16 ***
## age                0.067547   0.005571  12.124  < 2e-16 ***
## hypertension       0.539613   0.173055   3.118 0.001820 ** 
## avg_glucose_level  0.004802   0.001255   3.828 0.000129 ***
## heart_disease      0.404298   0.203447   1.987 0.046895 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1728.4  on 4908  degrees of freedom
## Residual deviance: 1374.6  on 4904  degrees of freedom
## AIC: 1384.6
## 
## Number of Fisher Scoring iterations: 7
library(ROCR)
pred = predict(log_reg2, type="response")

predObj = prediction(pred, stroke$stroke)
rocObj = performance(predObj, measure="tpr", x.measure="fpr") 
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:", round(aucObj@y.values[[1]] ,4)))

pred2 = pred
for (i in 1:4909){
        if(pred2[i] > 0.5){
                pred2[i] = 1
        }
        else{
                pred2[i] = 0
        }
}
pred2 <- factor(pred2)

library(caret)
confusionMatrix(pred2,factor(stroke$stroke))
## Warning in confusionMatrix.default(pred2, factor(stroke$stroke)): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4700  209
##          1    0    0
##                                           
##                Accuracy : 0.9574          
##                  95% CI : (0.9514, 0.9629)
##     No Information Rate : 0.9574          
##     P-Value [Acc > NIR] : 0.5184          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9574          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9574          
##          Detection Rate : 0.9574          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

The AUC is a bit smaller, but the accuracy, sensitivity and positive predicted value are very similar. Again, we should not take these results into consideration because the model literally classified every datapoint as 0.

Decision Tree

## DECISION TREE 1: USING THE PARAMETERS DISCOVERED BY LASSO
library(rpart)
library(rpart.plot)
fit <- rpart(stroke ~ age + hypertension + heart_disease + work_type  + avg_glucose_level + smoking_status,
             method ="class",
             data = stroke, 
             control=rpart.control(maxdepth=6,cp=0.001),
             parms=list(split='information'))
rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=0)

pred <-predict(object = fit, stroke, type = "class")
t <- table(pred, stroke$stroke)
confusionMatrix(t)
## Confusion Matrix and Statistics
## 
##     
## pred    0    1
##    0 4693  197
##    1    7   12
##                                           
##                Accuracy : 0.9584          
##                  95% CI : (0.9525, 0.9639)
##     No Information Rate : 0.9574          
##     P-Value [Acc > NIR] : 0.3789          
##                                           
##                   Kappa : 0.0989          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99851         
##             Specificity : 0.05742         
##          Pos Pred Value : 0.95971         
##          Neg Pred Value : 0.63158         
##              Prevalence : 0.95743         
##          Detection Rate : 0.95600         
##    Detection Prevalence : 0.99613         
##       Balanced Accuracy : 0.52796         
##                                           
##        'Positive' Class : 0               
## 

For this decision tree, the number of false positives is very low at 7. On the other hand, the number of true positives is relatively good at 12. This seems like a moderately good model in terms of performance, because the number of false negatives is a bit high at 197.

In order to be classified as a positive, a person must be either:

The positive predicted value is 95.9% (quite high) and the detection rate is 95.6%.

NOTE : We have limited the depth of the tree to 6 to prevent overfitting. We also set the cp parameter to 0.001 so that any split must decrease the overall lack of fit by a factor of 0.001 before being attempted.

## DECISION TREE 2: USING THE PARAMETERS DISCOVERED BY FORWARD SELECTION BASED ON P-VALUE
library(rpart)
library(rpart.plot)
fit2 <- rpart(stroke ~ age + hypertension + heart_disease+ avg_glucose_level,
             method ="class",
             data = stroke, 
             control=rpart.control(maxdepth=5,cp=0.001),
             parms=list(split='information'))
rpart.plot(fit2, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=0)

pred <-predict(object = fit2, stroke, type = "class")
t <- table(pred, stroke$stroke)
confusionMatrix(t)
## Confusion Matrix and Statistics
## 
##     
## pred    0    1
##    0 4697  205
##    1    3    4
##                                           
##                Accuracy : 0.9576          
##                  95% CI : (0.9516, 0.9631)
##     No Information Rate : 0.9574          
##     P-Value [Acc > NIR] : 0.4902          
##                                           
##                   Kappa : 0.0344          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99936         
##             Specificity : 0.01914         
##          Pos Pred Value : 0.95818         
##          Neg Pred Value : 0.57143         
##              Prevalence : 0.95743         
##          Detection Rate : 0.95681         
##    Detection Prevalence : 0.99857         
##       Balanced Accuracy : 0.50925         
##                                           
##        'Positive' Class : 0               
## 

For the second decision tree, the situation is similar, in that the number of false positives is 3 and the number of true positives is 4. This seems like a moderately good model in terms of performance, because, again, the number of false negatives is high at 205.

In order to be classified as a positive, a person must be older than 68, with an average glucose level between 127 and 132.

The positive predicted value is 95.8% (quite high) and the detection rate is 95.6%.

NOTE : We have limited the depth of the tree to 5 to prevent overfitting. We also set the cp parameter to 0.001 so that any split must decrease the overall lack of fit by a factor of 0.001 before being attempted.

Naive Bayes

Our second option was to do a Naive Bayes classifier, in the hope that we will obtain more reliable results:

library(e1071)
naive1 <- naiveBayes(stroke ~ age + hypertension + heart_disease + work_type  + avg_glucose_level + smoking_status, 
           data = stroke, 
           laplace = 0.01, 
           na.action = na.pass)

# NOTE: We CAN plug in numerical variables in the Naive Bayes classifier, not only categorical ones. 
# From the documentation, the naiveBayes function uses the standard deviation of the numerical variable when computing probability scores, which seems like a reliable method.

library(caTools)
pred <-predict(object = naive1, stroke, type = "class")
t <- table(pred, stroke$stroke)
library(caret,quietly = TRUE)
confusionMatrix(t)
## Confusion Matrix and Statistics
## 
##     
## pred    0    1
##    0 4214  121
##    1  486   88
##                                           
##                Accuracy : 0.8763          
##                  95% CI : (0.8668, 0.8854)
##     No Information Rate : 0.9574          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1732          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8966          
##             Specificity : 0.4211          
##          Pos Pred Value : 0.9721          
##          Neg Pred Value : 0.1533          
##              Prevalence : 0.9574          
##          Detection Rate : 0.8584          
##    Detection Prevalence : 0.8831          
##       Balanced Accuracy : 0.6588          
##                                           
##        'Positive' Class : 0               
## 

The results from the first Naive Bayes classifier are indeed a lot more reliable. The model did classifiy some instances as 1, with 121 false negatives and 88 true positives. The accuracy is high at 87.6%, as well as the positive predictive value at 97%. The detection rate is 85%. All in all, this is a much better classifier than logistic regression.

We also implemented another Naive Bayes classifier based on the minimal model discovered through forward selection based on p-values:

naive2 <- naiveBayes(stroke ~ age + hypertension + avg_glucose_level + heart_disease, 
                     data = stroke, 
                     laplace = 0.01, 
                     na.action = na.pass)
pred <-predict(object = naive2, stroke, type = "class")
t <- table(pred, stroke$stroke)
confusionMatrix(t)
## Confusion Matrix and Statistics
## 
##     
## pred    0    1
##    0 4255  124
##    1  445   85
##                                           
##                Accuracy : 0.8841          
##                  95% CI : (0.8748, 0.8929)
##     No Information Rate : 0.9574          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.18            
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9053          
##             Specificity : 0.4067          
##          Pos Pred Value : 0.9717          
##          Neg Pred Value : 0.1604          
##              Prevalence : 0.9574          
##          Detection Rate : 0.8668          
##    Detection Prevalence : 0.8920          
##       Balanced Accuracy : 0.6560          
##                                           
##        'Positive' Class : 0               
## 

This model is comparable to the previous one. It yielded 124 false negatives and 85 true positives. The accuracy is 88.4%, and the positive predicted value is 97%. The detection rate is also higher at 86.68%.

5. Conclusion

A brief summary of your findings from the previous sections without repeating your statements from earlier as well as a discussion of what you have learned about the data and your research question(s). You should also discuss any shortcomings of your current study, either due to data collection or methodology, and include ideas for possible future research.

Findings:

Future Directions: